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All cancers harbor molecular alterations in their genomes. The transcriptional consequences of these somatic mutations 
have not yet been comprehensively explored in lung cancer. Here we present the first large scale RNA sequencing study of 
lung adenocarcinoma, demonstrating its power to identify somatic point mutations as well as transcriptional variants such 
as gene fusions, alternative splicing events, and expression outliers. Our results reveal the genetic basis of 200 lung 
adenocarcinomas in Koreans including deep characterization of 87 surgical specimens by transcriptome sequencing. We 
identified driver somatic mutations in cancer genes including EGFR, KRAS, NRAS, BRAF, PIK3CA, MET, and CTNNBl, Candidates 
for novel driver mutations were also identified in genes newly implicated in lung adenocarcinoma such as LMTK2,ARID1A, 
N0TCH2, and SMARCA4. We found 45 fusion genes, eight of which were chimeric tyrosine kinases involving AL/C, RET, ROSl, 
FGFR2, AXL, and PDGFRA, Among 17 recurrent alternative splicing events, we identified exon 14 skipping in the proto- 
oncogene MET as highly likely to be a cancer driver. The number of somatic mutations and expression outliers varied 
markedly between individual cancers and was strongly correlated with smoking history of patients. We identified genomic 
blocks within which gene expression levels were consistently increased or decreased that could be explained by copy 
number alterations in samples. We also found an association between lymph node metastasis and somatic mutations in 
TP53. These findings broaden our understanding of lung adenocarcinoma and may also lead to new diagnostic and 
therapeutic approaches. 

[Supplemental material is available for this article.] 



Lung cancer is one of the most common cancers in humans, as well 
as the leading cause of cancer-related death worldwide (Jemal et al. 
2011). Although diagnosis at an early stage is increasing with the 
introduction of low-dose computerized tomography screening, 
lung cancer is still a devastating disease that has a very poor 
prognosis (Aberle et al. 2011). Lung cancer can be classified based 
on histopathologic findings, with adenocarcinoma being the most 
common type (Travis et al. 2005). Recently deeper understanding 
of the major genetic alterations and signaling pathways involved 
has suggested a reclassification of lung adenocarcinoma based 
on underlying driver mutations. Cancer cells with these genetic 
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alterations have survival and growth advantages over cells without 
such changes (Haber and Settleman 2007). Currently, approxi- 
mately 10 driver genes have been discovered in lung adenocarci- 
noma (Pao and Girard 2011). Clinical trials using new chemo- 
therapeutic agents targeting such alterations have demonstrated 
remarkable improvements in patient outcome, for example 
gefitinib (Mok et al. 2009; Maemondo et al. 2010) and crizotinib 
(Kwak et al. 2010) for lung adenocarcinoma harboring EGFR mu- 
tations and £ML4-ALiC (Soda et al. 2007) fusion, respectively. More 
recently not only point mutations but also tyrosine kinase gene 
fusions, such as KIF5B-RET, were identified as driver mutations (Ju 
et al. 2012). Nevertheless, we still do not know the molecular 
drivers of —40% of lung adenocarcinomas (Pao and Girard 2011). 
Interestingly, the frequencies of some driver mutations have 
been shown to be significantly different between ethnic groups 
(Shigematsu and Gazdar 2006), and therefore comprehensive 
cancer genome studies in a range of human populations will help 
to find new molecular alterations that can be targeted in treat- 
ments of lung cancer. 
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In this study, we broadly surveyed genetic alterations in 200 
fresh surgical specimens of lung adenocarcinoma in Koreans. 
Eighty-seven of these were analyzed by transcriptome sequencing 
combined with whole-exome (n = 76) and transcriptome se- 
quencing (n = 77) for matched adjacent normal tissue samples. 
Transcriptome sequencing is a powerful method for detecting 
driver mutations in cancer, since not only somatic point mutations 
but also aberrant RNA variants such as fusion genes and alternative 
splicing can be examined. Though advances in genomic technolo- 
gies have enabled genome-wide analyses of cancers, to our knowl- 
edge this is the first large-scale study of lung adenocarcinoma using 
RNA sequencing. 

Results 

Cancer samples analyzed in this study 

We collected 200 fresh surgical specimens of primary lung ade- 
nocarcinoma from patients who underwent major lung resection 
(Supplemental Fig. 1). For each patient, we recorded diagnosis, 
gender, cancer stage, and smoking status (Supplemental Table 1). 
Among the 200 cancer patients, the proportions of females and 
never-smokers were 54.5% {n = 109) and 58.0% (n = 116), re- 
spectively. Of these, 87 cancer tissues whose driver mutations were 
not detected by screening tests (Sanger sequencing for EGFR and 
KRAS point mutations and fluorescence in situ hybridization 
[FISH] for £ML4-ALiC fusion) (Supplemental Table 1; Supplemental 
Fig. 1) were analyzed by transcriptome sequencing combined with 
whole-exome {n = 76) and transcriptome {n = 77) sequencing 
of matched normal lung tissue samples (Supplemental Table 2; 
Supplemental Material). All these sequencing experiments were 
done as described previously (Ju et al. 2011). We generated 
14,038,673,860 paired-end 101-bp-long reads from RNA sequenc- 
ing of 164 samples (87 cancer and 77 corresponding normal tis- 
sues). On average, the RNA sequencing throughputs were 9.77 and 
7.38 Gbp for cancer and normal tissues, respectively. In the whole- 
exome sequencing of normal tissues, we obtained 32.96-fold read 
depth per tissue for regions targeted by the exome capture platform 
used in this study. 

Somatic point mutations 

Using our transcriptome data, we identified 4607 somatic non- 
synonymous single nucleotide substitutions and 373 coding short- 
indel mutations (Supplemental Fig. 2; Supplemental Table 3). 
Whole-exome sequencing of two randomly selected cancer sam- 
ples provided an estimate of 89.2% for the accuracy of our somatic 
mutation discovery (Supplemental Table 4). The median number 
of somatic mutations in each cancer sample was 25. Among a total 
of 87 samples, 45 carried driver mutations in well-known cancer 
genes in lung adenocarcinoma, such as EGFR (n = 22; in-frame 
deletions of exon 19, L858R and G719A) and KRAS (n = 18; G12C, 
G12V, G12D, G12S, G13C, andG13D) (Fig. 1; Supplemental Table 
3). In addition to EGFR and KRAS, other known mutations were 
also detected in NRAS (Sequist et al. 2011) {n = 3; Q61H, Q61L, 
Q61K), PIK3CA (Ding et al. 2008) (n = 2; H1047R and E555K), and 
BRAF (Ding et al. 2008) {n = 1; V600E), which have all been 
reported as driver mutations of lung cancer. In addition, two 
samples carried known activating mutations in well-known on- 
cogenes confirmed in other cancers (D32G of CTNNBl [Chan et al. 
1999], M1124D of MET [Schmidt et al. 1999]), suggesting those 
mutations are also able to induce lung adenocarcinoma. Overall, 



47 specimens harbored known point driver mutations in seven 
cancer genes (EGFR, KRAS, NRAS, PIK3CA, BRAF, CTNNBl, and 
MET), which we here refer to as "canonical point driver muta- 
tions." These mutations were mutually exclusive with one excep- 
tion of 1 EGFR and PIK3CA double-mutant. In addition to these 
known driver mutations, we also identified a set of genes, which 
were frequently mutated or highly overexpressed in a subset of 
cancers. Of note, TP53 was the most frequently mutated gene. 
CDKN2A, RET, N0TCH2, SMARCA4, LMTK2, ARIDIA, and MTOR 
were also frequently altered and are also worthy of note, since the 
functions of these genes are likely to be related to tumorigenesis or 
cancer maintenance (Fig. 1). The pairwise mutual exclusion and 
concurrence analysis for these mutated genes is shown in Sup- 
plemental Table 5. 

Fusion gene analysis 

One of the advantages of transcriptome sequencing over genome 
sequencing is that detection of transcriptional variants, such as 
fusion genes and alternative splicing, is feasible. Using the gene 
fusion program (GFP) introduced previously (Ju et al. 2012) and 
typical gene expression patterns, we identified 45 in-frame fusion 
transcripts from the 87 cancer tissues (Fig. 2; Table 1; Supplemental 
Fig. 3; Supplemental Table 6). We attempted to validate all of the 
fusion genes using PGR amplification of cDNA and Sanger se- 
quencing (Supplemental Fig. 4; Supplemental Table 7). Among 43 
fusion genes where PGR primer was available, 39 were successfully 
validated (Supplemental Table 7). Interestingly, the four invali- 
dated fusion genes all included either a surfactant gene (i.e., 
SFTPB or SFTPA2) or H19, whose wild-type gene expression was 
extremely high (greater than —2000 RPKM) in the corresponding 
specimen. This indicates that the fusion transcript may have been 
artificially synthesized during sequencing library construction. 

Of the fusion genes we identified, eight were chimeric tyro- 
sine kinases which are highly likely to play an important role in 
cancer development. Gancer specimens carrying one of the tyro- 
sine kinase fusions (n = 10) did not harbor any of the canonical 
point driver mutations (P-value = 2.12 X 10~^) (Supplemental Ta- 
ble 8). Of these eight fusion genes, four have been reported pre- 
viously (EML4-ALK [Soda et al. 2007], KIF5B-RET [Ju et al. 2012; 
Kohno et al. 2012; Lipson et al. 2012; Takeuchi et al. 2012], CD74- 
ROSl [Rikova et al. 2007; Takeuchi et al. 2012], and SLC34A2-ROS1 
[Rikova et al. 2007; Takeuchi et al. 2012]), and we refer to them here 
as "canonical transforming fusion genes." The remaining four fu- 
sion genes were novel (CCDC6-R0S1, FGFR2-CIT, AXL-MBIP, and 
SCAFll-PDGFRA) (Fig. 3A). Of these four novel fusion genes, 
CCDC6-R0S1 , FGFR2-CIT, and AXL-MBIP carry protein tyrosine 
kinase domains and dimerization units (Alberti et al. 2003; 
Ju et al. 2012) (coiled-coil or leucine zipper domains), both of 
which are essential to activate chimeric tyrosine kinases. The 
SCAFll-PDGFRA fusion is an example of promoter swapping (Kas 
et al. 1997). Because the cancer specimens harboring these four 
novel fusion genes did not carry any known driver mutations 
(neither canonical point driver mutations [n = 47] nor canonical 
transforming fusion genes [n = 6]; P-value = 0.021) (Supplemental 
Table 8), they may play important roles in cancer transformation. 
Other fusion genes identified in this study, such as MAP4K3- 
PRKCE, BCAS3-MAP3K3, ERBB2IP-MAST4, and APLP2-TNFSF11 
may also have functional importance since the genes are serine- 
threonine kinases or involved in signaling pathways. The co-oc- 
currence of 45 fusion genes with canonical point driver mutations 
is shown in Supplemental Table 6. 
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Figure 2. Graphical representation of 45 fusion genes identified from transcriptome sequencing of 
87 lung adenocarcinomas. Protein kinase-containing fusion genes are indicated with red lines joining 
the two genomic loci, while other fusions are indicated by blue lines. The protein kinase genes and their 
fusion partners are labeled in red and green, respectively (outer layer). 



Alternative splicing 

Alternative splicing is known to be related to the pathogenesis 
of colon cancer (Gardina et al. 2006). We assessed exon skipping 
events preferentially occurring in the cancer tissue using the tran- 
scriptome sequencing data. From a total of 87 tissues, we identified 
1 7 recurrent exon skipping events, where a specific exon of a gene is 
not in full transcription in cancer (Supplemental Table 9). In par- 
ticular, three cases of skipping of exon 14 in MET were interesting 
(Fig. 3B), since MET protein tyrosine kinase is a well-known on- 
cogene and this event was reported in lung adenocarcinoma pre- 
viously (Kong-Beltran et al. 2006). Although the number of spec- 
imens harboring MET exon-skipping (n = 3) is not sufficient for 
statistical testing, these three cancer ge- 
nomes did not carry any of the canonical 
point driver mutations (n = 47) or the ca- 
nonical transforming fusion genes {n = 6) 
(P-value = 0.057), suggesting that this 
exon-skipping event may have an in- 
dependent transforming effect in the 
cancer. In addition, three cancer spec- 
imens expressed a short form of FBEN2, 
skipping exon 9. Skipping of exon 9 of 
FBLN2 is also worthy of note (Supple- 
mental Fig. 5), since this gene was re- 
cently introduced as a tumor suppressor 
candidate in nasopharyngeal carcinoma 
(Law et al. 2012). 



Lung adenocarcinoma in smokers 

We compared the transcriptional land- 
scape of lung cancers between ever- 
smokers and never-smokers. There was a 



significant difference in the number of 
point mutations between the two groups 
(Fig. 4A). On average, smokers had signif- 
icantly more amino acid-altering single 
nucleotide and short-indel mutations (65.0 
and 20.6 mutations per cancer tissue of 
smokers [n = 40] and never-smokers [n = 
33], respectively; P-value = 0.0011). In- 
terestingly, the amount of smoking (pack- 
years) was positively correlated with the 
number of somatic point mutations in 
the cancer genome (P-value = 0.01; Sup- 
plemental Fig. 6). We also identified dif- 
ferences in mutational spectrums. Cancer 
tissues from smokers showed similar mu- 
tational signatures to those identified pre- 
viously (Pleasance et al. 2010) (C > A 
transversion, most frequent; T > G trans- 
version, least frequent), whereas cancers 
from never-smokers did not. C > A trans- 
version was more frequent in smokers 
(P-value = 3.1 x 10"^), while T > G trans- 
version was more common in never- 
smokers (P-value = 8.1 x 10"^^) (Fig. 4B). 
In addition, from the gene expression 
profiles (Supplemental Material; Supple- 
mental Table 10; Data Access), we detected 
a total of 6719 cancer outlier genes (COGs), 
which were extremely highly expressed 
in a small number of cancer tissues (Supplemental Table 11). The 
number of COGs per cancer tissue varied markedly (Fig. 4C), 
ranging from 0 to 989. The lung adenocarcinomas of smokers car- 
ried significantly more COGs than those of never-smokers (P-value = 
0.0078) (Fig. 4C,D; Supplemental Fig. 7). These findings demon- 
strate that lung adenocarcinoma in smokers harbors more somatic 
mutations and greater perturbation of gene expression levels. 



Co-Iocalization of over- and underexpressed genes 

Next, we assessed the gene expression pattern of each specific 
cancer specimen relative to the general transcriptional landscape 



Table 1. The list of 15 selected gene fusions identified from 87 lung adenocarcinomas 
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EML4 


ALK 


chr2;chr2 


1 


12.252 


PTK 


KIF5B 


RET 


chrl 0;chr10 


4 


11.227 


PTK 


CD74 


ROSl 


chr5;chr6 




Interchromosomal 


PTK 


SLC34A2 


ROSl 


chr4;chr6 




Interchromosomal 


PTK 


Novel 












CCDC6 


ROSl 


chrl0;chr6 




Interchromosomal 


PTK 


SCAF1 1 


PDGFRA 


chrl2;chr4 




Interchromosomal 


PTK 


FGFR2 


CIT 


chrl 0;chr12 




Interchromosomal 


PTK 


AXL 


MBIP 


chrl 9;chr14 
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PTK 


MAP4K3 


PRKCE 


chr2;chr2 




6.215 


Ser/Thr kinase 


BCAS3 


MAP3K3 
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2.23 


Ser/Thr kinase 


ERBB2IP 


MAST4 


chr5;chr5 




0.515 


Ser/Thr kinase 


KRAS 


CDH13 


chrl 2;chr16 




Interchromosomal 


Signaling 


APLP2 


TNFSF1 1 


chrll;chrl3 




Interchromosomal 


Signaling 


ZFYVE9 


CCA 


chr1;chr6 




Interchromosomal 


Signaling 


TPD52L1 


TRMT1 1 


chr6;chr6 




0.723 


Tumor protein 



(PTK) Protein tyrosine kinase. 
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Figure 3. Fusion genes and alternative splicing events revealed by RNA sequencing. (A) Sche- 
matic figures of the domain structures of novel protein kinase fusion genes. (B) Exon 14 skipping 
in MET proto-oncogene demonstrated by read depth across gene model. (TM) 7rons-membrane 
domain. 



of all 87 cancer tissues. After identifying genes which were rela- 
tively overexpressed and underexpressed in each cancer, we in- 
terestingly observed that these sets were spatially grouped together 
in the genome (Fig. 5 A; Supplemental Fig. 8). We defined those 
regions containing such groups as jointly regulated blocks (JRBs). 
The number of JRBs was highly variable among cancer tissues. In 
order to investigate the cause of these JRBs, we performed com- 
parative genomic hybridization (CGH) array (Park et al. 2010) ex- 
periments on a subset of cancer samples {n = 9). Interestingly 
combined analyses between array results and JRBs showed that 
— 70% of JRBs can be explained by the copy number status of the 
cancer genome (Fig. 5A,B). Recent reports have also shown that 
cancer genomes harbor large hypo-methylated (and hyper-meth- 
ylated) blocks (Wen et al. 2009; Hansen et al. 2011), suggesting the 
combined effect of somatic copy number alterations and DNA 
methylation patterns are likely to induce the diversity of gene 
expression profiles in cancer tissues. 

We merged the JRBs identified from 87 lung adenocarci- 
noma samples. This clearly showed that the blocks are not ran- 
domly distributed in the genome (Fig. 5C). For example, gene 
expression is frequently increased on the short arm of chromo- 
some 7, while expression is frequently decreased on the short 
arm of chromosome 3. These patterns correlate with frequent 



copy number alterations of cancer ge- 
nomes identified in previous studies (Weir 
et al. 2007; Job et al. 2010). 

Lymph node metastasis and TP53 
mutation 

We investigated the correlation of somatic 
alterations with lymph node metastasis 
(information for lymph node metastasis 
is available in Supplemental Table 1). We 
divided the cancer samples into two 
groups: those with known or candidate 
driver mutations (canonical point driver 
mutations [n = 47], canonical transforming 
fusion genes [n = 6], novel tyrosine kinase 
fusion genes [n = 4; CCDC6-R0S1, FGFR2- 
CIT, AXL-MBIP, SCAFll-PDGFRA], and 
MET exon 14 skipping [n = 3]) and those 
without. We performed multivariate lo- 
gistic regression for the presence or ab- 
sence of lymph node metastasis including 
gender, age, cancer stage, and smoking 
status as factors. Cancers with known or 
candidate driver mutations did not show 
a higher rate of lymph node metastasis 
than those without (P- value = 0.15; mul- 
tivariate logistic regression) (Supplemen- 
tal Table 12). However, cancer patients 
harboring both a known or candidate 
driver mutation and TP53 mutations 
showed significantly higher rates of lymph 
node metastasis (P-value = 0.017; multi- 
variate logistic regression) (Supplemental 
Table 12). This implies that activated on- 
cogenes and disrupted tumor suppressor 
genes such as TP53 may together con- 
tribute to cancer metastasis. In addition 
to driver and TP53 mutations, cancer 
stage showed significant association with lymph node metastasis 
(P-value = 0.00018; multivariate logistic regression) (Supplemental 
Table 12). 



Summary of driver mutations in lung adenocarcinoma 

We summarize the mutational profiles of the 200 lung adenocarci- 
nomas in Figure 6, including the results from transcriptome se- 
quencing and from screening tests (99 with EGFR mutations, six 
with KMAS mutations, and seven with EML4-ALK fusions). The fre- 
quencies of EGFR and KRAS mutations in Korean patients were 
60.5% {n = 121) and 12.0% {n = 24). Overall, -75.5% {n = 151/200) of 
lung adenocarcinomas are considered to be driven by point muta- 
tions in the 200 patients. In addition to point mutations, we found 
1 7 tissues with fusion protein tyrosine kinase genes (ALK, RET, ROSl, 
FGFR2, AXE, and PDGFRA; 10 from transcriptome sequencing and 
seven from FISH study), which comprises 8.5% of all samples. Three 
samples (1.5%) carried activating exon skipping of MET tyrosine 
kinase, suggesting that —10% of lung adenocarcinoma drivers are 
transcriptional variants that can be best investigated through tran- 
scriptome sequencing. Although we could not identify canonical 
driver mutations in 26 cancer tissues, we suggest some specific ab- 
errations of note for each individual tissue (Supplemental Table 13). 
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Figure 4. Mutational and transcriptional variation in cancer between never-smokers and smokers. {A) 
The number of somatic mutations (nonsynonymous single nucleotide and short-indel mutations) in the 
cancer tissue of each patient. Patients are classified into never-smokers and smokers, and further sorted 
by mutation count. (Inset) Box plot of somatic mutation counts for never-smokers and smokers. The two 
groups are significantly different (P = 0.001 079). (B) The proportion of the six possible nonsynonymous 
substitutions found within smokers and never-smokers. The two groups were significantly different with 
respect to transversions C > A and T > G (**, P < 0.001) and transversion T > A (*, P < 0.01). (C) The 
number of cancer-outlier genes (COGs; extremely high-expressed genes in a subset of cancer speci- 
mens; see Methods for details) in each cancer tissue. Patients are sorted as above. (Inset) Box plot 
showing that lung adenocarcinoma in smokers contains more cancer-outlier genes. (D) Gene expression 
within cancer tissues against average expression in normal tissue. Scatter plots for patients LC_S33 (a 
never-smoker patient) and LC_S51 (a smoker patient) are shown, providing an example of the variation 
in gene expression perturbation. Selected genes of interest are labeled. Genes were categorized as 
"Cancer-up" where generally overexpressed and "Cancer-down" where generally underexpressed in 
lung cancer compared with paired-normal tissue by hierarchical clustering (see Supplemental Material). 



Discussion 

The landscape of lung cancer genomes has been widely in- 
vestigated using genotyping microarray sequencing of targeted 
cancer genes, CGH array exome sequencing, and whole-genome 
sequencing (Ding et al. 2008; Beroukhim et al. 2010; Lee et al. 
2010). These studies provided the large repertoire of known ge- 



nomic abnormalities in cancer genes (i.e., 
EGFR and KRAS), identified critical path- 
ways (i.e., MAPK and PI3K pathway) for 
cancer transformation, and suggested pu- 
tative druggable targets to develop more 
efficient treatments (Herbst et al. 2008). 
The transcriptional landscape of lung 
adenocarcinomas, however, has not yet 
been widely explored, although fusion 
genes, alternative splicing events, and 
gene expression outliers may have critical 
roles in cancer transformation. In addi- 
tion, there are several unique character- 
istics of lung cancer in East Asians (Bell 
et al. 2008), including a large proportion 
of female patients, the predominance of 
adenocarcinoma over other types, and 
a high frequency of EGFR point muta- 
tions compared with Europeans. Investi- 
gation of such features may provide in- 
sights for the treatment or prevention of 
lung cancer in East Asians. Yet large-scale 
analysis of lung cancer genomes has not 
been performed in this ethnic group. 

In this study, we have extensively 
analyzed the transcriptomes of 87 lung 
adenocarcinomas in Korean patients. Ad- 
ditional whole-exome and transcriptome 
sequencing for the adjacent paired-nor- 
mal tissues was also performed to in- 
crease the specificity in identifying so- 
matic mutations. 

Transcriptome sequencing is a pow- 
erful tool to understand cancer because it 
captures a snapshot of diverse aspects of 
transformed cells. For instance, through 
whole-genome or whole-exome sequenc- 
ing we can check for the presence of 
somatic mutations in cancer. However, 
transcriptome sequencing also provides 
a picture of dynamic consequences rather 
than just the mutations themselves. We 
can profile gene expression levels, gene 
fusions, and alternative splicing events 
simultaneously, all of which contribute to 
the proliferation of cancer cells. More- 
over, RNA-seq is a very sensitive tool to 
identify point mutations. For example, 
six specimens which were negative for 
EGFR point mutations in the conven- 
tional screening test were discovered to 
harbor EGFR point mutations by tran- 
scriptome sequencing in this study. We 
believe that RNA sequencing is likely to 
outperform genome sequencing in de- 
tection of cancer driver mutations, especially when tumor purity is 
relatively low. Genes with driver mutations in cancer cells are likely 
to be more highly expressed than in normal cells, therefore en- 
hancing the signal-to-noise ratio in RNA-seq. For both approaches, 
it is important that systems are implemented which ensure the 
efficient collection and preservation of cancer tissues from clinic to 
bench. 
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Figure 5. JRBs identified from gene expression signatures. (A) Large JRBs observed on chromosome 5 
in one cancer sample (patient LC_S51 ) and its high correlation with CGH array results. (Top row) Relative 
expression levels of the genes on chromosome 5 (gray dots), their moving averages (red line), and 
detected JRBs (red horizontal bars). (Middle row) CGH array results for patient LC_S51 . Log2 ratio of 
probes (blue dots) and identified copy number alterations (blue horizontal bars). (Bottom row) Karyo- 
gram of chromosome 5. (B) Correlation between JRBs and CGH array data for three cancer specimens. 
The X-axis represents the averaged Z-scores of JRB and the /-axis indicates the averaged CGH array log2 
ratios for the genomic area. (C) The genomic location of JRBs and number of cancer tissues involved. 
Increased- and decreased-expression JRBs are shown in blue and red bars, respectively. 



Tumor lieterogeneity is an important issue in cancer genome 
studies (Marusylc and Polyalc 2010; Sliali et al. 2012). Cancer tissue 
specimens witli a low proportion of cancer cells require deeper 
read-depth when sequencing. We collected specimens from the 
center of tumors to attempt to obtain pure samples. The differences 
in numbers of somatic mutations per case as well as intrinsic var- 
iability in read allele frequency in RNA-seq (i.e., allele-specific ex- 
pression and expression differences between cells) do not allow 
accurate estimation of tumor heterogeneity from cancer tran- 
scriptome sequencing data. Ignoring these issues, the distribution 
of read allele frequencies in cancer transcriptome sequencing for 
somatic point mutations suggests that the purity of the 87 speci- 
mens studied is —80% on average (Supplemental Fig. 9). In addi- 
tion, to increase the sensitivity in detection of somatic mutations, 
we performed deep RNA-sequencing (—10 Gb was sequenced 
for each cancer specimen) and applied relatively tolerant criteria 
for variant detection (i.e., read allele ratio should be >10% for 
SNV detection). However, somatic mutations from clones with 
lower frequencies may be unidentified in this study. One point 
worthy of note is that transcriptome sequencing can only detect 
somatic mutations from genes active in transcription. Somatic 



mutations in nontranscribed genes can- 
not be detected. For example, a recent 
study reported that only —36% of somatic 
mutations were expressed in breast can- 
cers (Shah et al. 2012). However, this may 
be an advantage of transcriptome se- 
quencing because somatic driver muta- 
tions must be transcribed to have a func- 
tional impact on the progression of cancer 
cells. Transcriptome sequencing enables 
us to focus on functionally active somatic 
mutations, thus providing functional in- 
sights into cancer genomes. 

One of the aims of this project was to 
discover transforming fusion genes in 
lung adenocarcinomas. We estimated that 
studying 200 cancer tissues would pro- 
vide 95.1% power to detect transforming 
fusion genes with a frequency of 1.5% 
in lung adenocarcinoma (Supplemental 
Material). By utilizing transcriptome se- 
quencing technology, which is the most 
powerful method currently available to 
identify novel fusion genes (Maher et al. 
2009), we found that EML4-ALK, KIF5B- 
RET, and ROSl fusions are the three major 
transforming fusion genes of adenocar- 
cinoma in Koreans. We also identified 
novel gene fusions, including three pro- 
tein tyrosine kinase fusions. These novel 
fusion genes appear to be rare events in 
lung adenocarcinomas, because we iden- 
tified only a single case of each fusion 
gene among 200 samples. However, they 
may be good druggable targets, and more 
functional assessments of them are re- 
quired in further studies. 

From this study, we obtained evi- 
dence that expression levels of genes 
within a specific genomic region can 
be overexpressed (or underexpressed) to- 
gether in JRBs, which are likely affected by differentially methyl- 
ated regions (DMRs) or copy number alterations, which have been 
reported to be important in carcinogenesis (Beroukhim et al. 2010; 
Hansen et al. 2011). Our CGH array analysis confirmed that ge- 
nomic structural variations have a large-scale impact on the con- 
trol of gene expression levels within such blocks. Future integrative 
studies, combining genomic structural variations, epigenomic 
changes, and gene expression levels of cancers are necessary to 
understand the fine-scale mechanisms that control gene expres- 
sion in cancer cells. 

Finally, we observed that the expression landscapes of the 
cancer tissues were extremely heterogeneous. As seen in the so- 
matic point mutation and gene expression profile analyses, a sub- 
set of cancers harbored an extreme number of somatic point mu- 
tations and outlier genes. The pattern was unpredictable, but was 
not random and was associated with cigarette smoking. A recent 
study analyzing the impact of smoking on human normal lung 
tissues also supports our finding (Bosse et al. 2012). 

In summary, we have comprehensively identified the geno- 
mic and transcriptional aberrations underlying lung adenocarci- 
noma in Koreans. The successful discovery of many aberrations in 
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Figure 6. A summary of the mutational profiles of 200 lung adeno- 
carcinomas. Pie chart shows the distribution of driver mutations identified 
in 200 lung adenocarcinoma patients in this study. 



cancer genes, such as somatic mutations, gene fusions, alternative 
splicing events, and cancer outliers, is most likely due to the strong 
power and comprehensive nature of whole-transcriptome se- 
quencing. Our approach suggests a paradigm for large-scale deep 
transcriptome sequencing initiatives for a number of different 
cancer types. Our findings provide guidance for future trans- 
lational studies correlating characteristics of cancer tissues and 
clinical features, such as drug response, recurrence, and survival. 

Methods 

RNA and exome sequencing 

All protocols of this study were approved by the Institutional 
Review Board of Seoul National University Hospital (Approval # 
C-1111-102-387) and Seoul St. Mary's Hospital (Approval # 
KC11TISI0678). 

All the cancer and adjacent paired-normal tissue specimens 
used in this study were acquired from surgical specimens. Cancer 
and normal tissue specimens were grossly dissected and preserved 
immediately in liquid nitrogen after surgery. For RNA-seq, we 
extracted RNA from tissue using RNAiso Plus (Takara Bio Inc.), 
followed by purification using RNeasy MinElute (Qiagen Inc.). 
RNA was assessed for quality and was quantified using an RNA 
6000 Nano LabChip on a 2100 Bioanalyzer (Agilent Inc.). The 
RNA-seq libraries were prepared as previously described (Ju et al. 
2011). 

For exome sequencing of matched normal tissues (and cancer 
specimens LC_C5 and LC_C21 for validation purposes), genomic 
DNA was extracted from normal lung. Genomic DNA (3 |jLg) from 
each sample was sheared and used for the construction of a paired- 
end sequencing library as described in the protocol provided by 
Illumina. Enrichment of exonic sequences was then performed 
for each library using the SureSelect Human All Exon 50Mb Kit 
(Agilent Inc.) following the manufacturer's instructions. 

Libraries for RNA and exome sequencing were sequenced 
with Illumina TruSeq SBS Kit v3 on a HiSeq 2000 sequencer (Illu- 
mina Inc.) to obtain 100-bp paired-end reads. The image analysis 
and base calling were performed using the Illumina pipeline (vl.8) 
with default settings. 



Screening tests 

Screening genetic tests were performed for identification of three 
well-known driver mutations in a subset of the 200 lung adeno- 
carcinoma tissues as previously described: (1) Exon 18-21 of EGFR 
by PGR and Sanger sequencing (Lynch et al. 2004) {n = 164); (2) 
Exon 2 of KRAS by PGR and Sanger sequencing (Eberhard et al. 
2005) {n = 37); (3) EML4-ALK fusion genes by FISH (Kwak et al. 
2010) (n = 163). The results of these studies are summarized in 
Supplemental Material and Supplemental Table 1. 

Smoking history 

Of the 87 individuals whose cancer specimens were RNA se- 
quenced, smoking history before diagnosis of lung cancer was 
provided by 83 (47 smokers, 36 never-smokers, and four un- 
knowns). Information about the amount of smoking (pack-years) 
was available for 23 out of 47 smokers. 



Sequence analyses 

RNA and exome sequencing reads were aligned to the NGBI hu- 
man reference genome assembly (build 37.1) using GSNAP (Wu 
and Nacu 2010) with allowance for 5% mismatches. In the same 
manner, the RNA sequencing reads were also aligned to a cDNA set 
consisting of 161,250 mRNA sequences obtained from public da- 
tabases (36,742 RefSeq, 73,671 UGSG, and 161,214 Ensembl) to 
decrease the false positives and false negatives in variant detection 
from RNA sequencing data (Ju et al. 2011, 2012). The expression 
levels for 36,742 RefSeq genes were measured by uniquely aligned 
RNA sequencing reads. For each gene, the number of reads aligned 
to it (raw read count) was normalized by RPKM (reads per kilobase 
per million mapped reads) (Mortazavi et al. 2008). 

Somatic single nucleotide and short-indel discovery 

We first identified single nucleotide variations (SNVs) and short- 
indel variants in cancer using the transcriptome sequencing data. 
To minimize false positive calls generated by misalignment, we 
used variant calls commonly identified from both the genome and 
the mRNA alignment. SNVs were defined according to the fol- 
lowing three conditions: (1) The number of uniquely mapped 
reads at the position should be >3; (2) the average base quality for 
the position should be >20; (3) the allele ratio at the position 
should be >10% for SNVs. Indels were called by the same pro- 
cedure. Gene annotations for the variants were done using RefSeq 
genes. To identify somatic mutations, we removed SNVs and indels 
that were also identified in the normal tissue counterparts (76 
whole-exome and 77 transcriptome sequencing). To remove po- 
tential germline variants that might not be filtered by this step, 
common germline SNPs that exist in normal human populations 
(variants identified from phase I of The 1000 Genomes Project [The 
1000 Genomes Project Gonsortium 2010], variants with minor 
allele frequency >1% from dbSNP 132 [Altshuler et al. 2010], and 
variants identified in normal Korean individuals [Ju et al. 2011]) 
were assumed to be unrelated to cancer transformation and were 
removed. These filtration steps might fail to remove some rare 
germline variants if the position was insufficiently covered 
in normal exome and transcriptome sequencing. However, we 
mostly focused on recurrent mutations in the later analysis, which 
are highly unlikely to be rare germline variants. Of note, among 
the 87 cancer specimens where RNA sequencing was performed, 1 1 
did not have whole-exome sequencing of normal counterparts. 
Therefore, our lists of somatic mutations for the 1 1 cancer speci- 
mens are likely to include more rare germline variants than those 
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of the other specimens. We did not consider the 11 specimens in 
the statistical analysis of the number of somatic mutations and 
smoking history of lung cancer patients. 

Fusion gene detection 

We detected in-frame fusion genes by utilizing the GFP software 
described in our previous report (Ju et al. 2012). Briefly, the method 
makes use of discordant read pairs, where two ends of a pair are 
mapped to different genes and exon-spanning reads across the 
exonic fusion breakpoint of chimeric transcripts. Additional fil- 
tration cascades (homology filter, fusion-spanning read filter, and 
the fusion point filter) were applied to remove false positives (Ju 
et al. 2012). In addition to GFP analysis, we also assessed read depth 
along each exon of tyrosine kinases (e.g., ALK, RET, and ROSl), 
since abrupt overexpression after fusion gene breakpoints is 
a hallmark of fusion for these genes (Supplemental Fig. 3; Ju et al. 
2012; Lipson et al. 2012). We do not report intrachromosomal 
fusions between adjacent genes (<200 kb) because these are as- 
sumed to be due to read-through transcription and so do not 
originate in genomic rearrangement, e.g., translocation, inversion, 
or large deletion (Ju et al. 2012). Finally, off-frame fusion genes 
were removed. 

Alternative splicing 

First, exon-skipping reads were extracted from the collection of 
sequencing reads for each cancer sample. The GSNAP alignment 
tool conveniently allows reads to be split into two segments and 
mapped to different exons when genomic positions of exons are 
provided to the program. We collected those spliced sequencing 
reads where nonadjacent exons of a gene were joined and defined 
them as exon-skipping sequencing reads. For candidate skipped 
exons supported by at least two exon-skipping reads, we obtained 
the expression levels for the candidate exon and its neighboring 
exons in terms of RPKM. Since the expression level for an exon is 
correlated with the expression level for the gene from which it 
derives, the expression of the exons was normalized by the ex- 
pression level of the gene to which they belong. Next, the fold 
changes between the 5' neighboring and the 3' neighboring exons 
and the candidate exon were calculated and averaged. With an 
assumption of normal distribution, a Z-value for the fold change 
was obtained by considering all the fold changes calculated in the 
same manner in all the normal samples (n = 11). If any of the 
normal averaged fold changes was <0.9, the candidate skipped 
exon was not further considered because dropped coverage was 
not specific to cancer. 

Differentially expressed gene [DEG] selection 

We selected DEGs by clustering genes by expression levels. First, 
genes with either RPKM <1.0 or coefficient of variation (CV) <0.7 
were excluded to remove genes noninformative for clustering. This 
resulted in a total of 3051 unique genes. Log2 transformation and 
additional row-wise and column-wise normalization was applied: 
Row-wise normalization was applied to each gene by subtracting 
the gene's median expression value from individual expression 
values. Similarly, column-wise normalization was applied to each 
sample by subtracting the sample's median expression value from 
individual expression values. This guarantees that row- wise and 
column-wise median expression values were both set to zero. 
Then, hierarchical clustering was done by Gene Cluster 3.0 with 
default parameters (de Hoon et al. 2004), correlation (uncentered), 
and complete linkage. A heatmap was drawn by R package function 
heatmap . 2 (http : //cran . r-pro j ect.org/web/packages/gplots/index. 



html). Finally, we referred to the hierarchical tree generated by the 
clustering process and selected three types of DEGs (cancer-UP, 
cancer-DOWN, and mixed). 

Cancer outlier gene analysis 

COG analysis was performed for 22,427 RefSeq genes across a total 
of 164 RNA-seq results with modified criteria suggested previously 
(Tomlins et al. 2005; MacDonald and Ghosh 2006). The analysis 
pipeline is as follows. (1) All the expression values are subtracted by 
their median, which sets the gene's median to zero (location nor- 
malization). (2) All the expression values are then divided by their 
median absolute deviation (MAD) (scale normalization). (3) Given 
a set of normalized expression values, q75 + 3 X IQR is defined as 
an outlier cutoff where q75 is the 75*^ percentile expression value 
and IQR (inter quartile range) is the absolute difference between 
the 25*^ and the 75*^ percentile expression values. An expression 
value is accepted as an outlier when its original RPKM is > 1.0 and 
its normalized expression value exceeds the estimated outlier 
cutoff. (4) Genes showing an outlier pattern in any normal samples 
are excluded. (5) Finally, genes that exhibit an outlier pattern in at 
least one cancer sample are chosen as candidate COGs. 

Jointly regulated block QRB) identification 

To generate the gene expression signatures of each cancer sample, 
we calculated the ratio of gene expression levels in a cancer tissue 
to the average expression levels in 77 adjacent normal tissues. The 
gene set for analysis was formed from 36,742 transcripts in the 
RefSeq database, which yielded 22,427 genes after filtering out 
redundant entries. 

To quantify relative expression among cancer samples, we 
compared the gene expression ratio of a gene (gene A) in the i* 
cancer sample with the ratio in all 87 cancer tissues and calculated 
a normalized Z-score as follows: 

/ _ expression ratiOi^A - average expression ratiOA\ 



where SD is the standard deviation of expression ratios. 

During this process, genes with low expression levels (maxi- 
mum expression <3 RPKM) or genes with small variance in ex- 
pressions (relative standard variation <0.1) were removed, since 
signal-to-noise ratio for these genes is not sufficient. This left 
16,419 genes for further investigation. 

Given a set of Z-scores, we calculated the moving average of 
10 Z-scores by walking through the chromosomes. An increased 
expression JRB is defined as starting at a gene with a Z-score >1.5 
and extending in both directions until a Z-score <0.5 is reached. 
Once its boundary is determined, the JRB must satisfy at least one 
of the three following criteria: (1) more than 40 genes within 
a block, (2) more than 20 genes in a block and an average Z-score 
>1.2, (3) an average Z-score >2.0. On the other hand, we applied 
slightly different conditions in discovering decreased expression 
JRBs to increase the sensitivity. A decreased expression JRB is de- 
fined as starting at a gene with a Z-score less than -1.0 and 
extending in both directions until a Z-score greater than -0.5 is 
reached. Then the JRB must satisfy at least one of the three fol- 
lowing criteria: (1) more than 40 genes within a block, (2) more 
than 20 genes in a block with an average Z-score less than -0.8, (3) 
an average Z-score less than -1.0. 

The comparison of JRBs and copy number alterations pro- 
vided by CGH array was done by calculating the correlation (r^) 
between the averaged Z-scores of JRBs and the averaged log2 ratio 
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values of probes within the same JRBs. CGH array analyses were 
performed using nine cancer samples (LC_C7, LC_C21, LC_C25; 
LC_C35, LC_S19, LC_S23, LC_S39, LC_S42, andLC_S51) and their 
normal counterparts using a customized Agilent 180K CGH array 
platform (Park et al. 2010). CGH array experiments were con- 
ducted according to the manufacturer's instructions (Agilent Inc.). 
Briefly genomic DNA from cancer and adjacent-normal specimens 
were labeled by Cy-5 and Cy-3 dye, respectively and hybridized. 
Log2 ratio was calculated by image analysis of CGH array using the 
CGH-105_Jan09 protocol (Agilent Inc.) for background sub- 
traction and normalization. 

Validation of fusion genes 

Fusion genes were validated using PGR amplification of fusion 
gene breakpoints of chimeric cDNA and Sanger sequencing. The 
PGR reactions were 10 min at 95°C, 30 cycles of 30 sec at 95°C, 30 
sec at 62°C and 30 sec at 72°C, and finally 10 min at 72°C. All the 
Sanger sequencing experiments were performed at Macrogen Inc. 
(http://www.macrogen.com). PGR and Sanger sequencing primers 
are shown in Supplemental Table 7. 

Statistical analyses 

The differences in number of somatic mutations and GOGs be- 
tween smokers and never-smokers (Fig. 4A,G) were tested using 
Student's f-test. x' tests were applied on the difference in mutation 
spectrums (Fig. 4B). Logistic regression analysis was performed to 
assess the relationship between somatic mutations (known or 
candidate driver mutations and TP53 mutations) and lymph node 
metastasis using gender, age, cancer stage, and smoking status as 
covariates (Supplemental Table 12). Two-sided P-values were cal- 
culated for all these statistical tests. 

Mutual exclusion and concurrence analysis 

We carried out pairwise mutual exclusion and concurrence anal- 
ysis for genes that showed more than three mutations (including 
somatic point mutations, fusion genes, and skipped exons). For 
a given pair of mutated genes A and B, we recorded the number of 
samples in possible four categories (A mutated only, B mutated 
only, both A and B mutated, and neither). Then Fisher's exact test 
was performed to infer the mutual dependency between the two 
genes. Once two genes were determined to be mutually dependent 
on each other by the test, their mutual exclusion/concurrence 
was determined by calculating Pearson's correlation coefficient, 
r (mutual exclusion: r < 0 and concurrence: r > 0). 

Data access 

Gene expression values for 87 lung adenocarcinomas and 77 
adjacent normal tissues can be viewed at http://gene.gmi.ac.kr 
and at the NGBI Gene Expression Omnibus (GEO) (http:// 
www.ncbi.nlm.nih.gov/geo/) under accession number GSE40419. 
Transcriptome and exome sequencing data are uploaded to the FBI 
European Nucleotide Archive (http://www.ebi.ac.uk/ena/home) un- 
der accession number ERP001058 (transcriptome sequencing) and 
ERP001575 (exome sequencing). 
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